Reconfigurable Web wrapper agents for biological information integration
نویسندگان
چکیده
A variety of biological data is transferred and exchanged in overwhelming volumes on the World Wide Web. How to rapidly capture, utilize and integrate the information on the Internet to discover valuable biological knowledge is one of the most critical issues in bioinformatics. Many information integration systems have been proposed for integrating biological data. These systems usually rely on an intermediate software layer called wrappers to access connected information sources. Wrapper construction for Web data sources is often specially hand coded to accommodate the differences between each Web site. However, programming a Web wrapper requires substantial programming skill and is time-consuming and hard to maintain. This paper provides a solution for rapidly building software agents that can serve as Web wrappers for biological information integration. We define an XMLbased language called WNDL, which provides a representation of a Web browsing session. A WNDL script describes how to locate the data, extract the data and combine the data. By executing different WNDL scripts, user can automate virtually all types of Web browsing sessions. We also describe IEPAD, a data extractor based on pattern discovery techniques. IEPAD allows our software agents to automatically discover the extraction rules to extract the contents of a structurally formatted Web page. With a programming-by-example authoring tool, a user can generate a complete Web wrapper agent by browsing the target Web sites. We built a variety of biological applications to demonstrate the feasibility of our approach. The software is available at http://chunnan.iis.sinica.edu.tw/software.html or by contacting the authors.
منابع مشابه
Reconfigurable Web Wrapper Agents for Web Information Integration
In this paper, we presented a tool to exploit online Web data sources using reconfigurable Web wrapper agents. We described how these agents can be rapidly generated and executed based on the script language WNDL and extraction rule generator IEPAD. WNDL is an XML-based language that provides a representation of a Web browsing session. A WNDL script describes how to locate the data, extract the...
متن کاملReconfigurable Web Wrapper Agents
directly access the data. Web wrappers, however, must automate Web browsing sessions to extract data from the target Web pages so other applications can process that data. Each Web site has its own set of links, layout templates, and syntax. You could, in a brute-force solution, program a wrapper for each browsing session. However, such wrappers are sensitive to Web site changes and become diff...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملBiological Data Extraction and Integration — A Research Area Background Study
My research field is highly diverse. It interweaves many different areas in information technology and bioinformatics. The system I propose to implement can automatically locate, understand, and extract online biological data independent of the source and also make it available for Semantic web agents. This research field requires background knowledge from (1) Information Extraction, (2) Schema...
متن کاملA Web Service Based Framework for Information Integration of the Process Industry Systems
Many process industry subsystems were developed separately. They didn’t collaborate efficiently, which makes it difficult for information integration. In this paper, a web service based framework is presented to address this problem, in which every process industry subsystem is described as a web service by web service wrapper and registered in the UDDI register centre, according to different i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JASIST
دوره 56 شماره
صفحات -
تاریخ انتشار 2005